- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2k
Correctness evaluator #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some time ago, I worked on a project very similar to this one. I found it very useful to include the reason in the response. You can see an example here:
ValidatorAgent.java
I believe using a score to evaluate the answers might not be the best option, as the expected output for a test is typically a boolean (test passes or not). Using a score could make it harder to properly test an answer because the developer would need to decide on a threshold, which could be very arbitrary.
| This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the  | 
| 
 I believe it makes more sense to rely on the judgment of the LLM to determine if the test passes or not. However, I've seen many examples where evaluations are based on scores, so I might be wrong. If you always want the explanation included in the response, I suggest being more explicit in the prompt. Currently, you are asking for: Output a single score that represents a holistic evaluation. I believe this could lead the LLM to omit the explanation. When I was working on this, I realized that I had to be extremely explicit about what I wanted. | 
This PR adds a new evaluator to judge correctness of a response. It is based loosely on LlamaIndex's
correctness.pyevaluator.Note that I couldn't decide the best way to provide the reference answer. I saw three choices:
EvaluationRequest- This seemed like the worst choice, as it would be a property that is probably only used for this evaluator.evaluate()method to take anEvaluationRequestas well as the reference answer. I started this way, but it felt clunky.EvaluationRequestwithCorrectnessEvaluationRequestand implementevaluate()to check for aCorrectnessEvaluationRequestand use its reference answer. This is the option I chose.Also note that this change is built upon the change in #967, such that it takes a
Stringfor the response inEvaluationRequest.